A case against Active-Active firewall clusters

Published: 2023-11-27

Over the years, I have seen more and more posts on the Internet from network engineers asking whether their new network design should use a firewall cluster that is Active-Passive (A-P) or Active-Active (A-A). My response is to always go with A-P, and I figured I would take this opportunity to go into some detail on why I think it's the best solution. This article is written based on the Fortigate firewall appliance, but I believe these concepts can be generalized to other vendors aswell.

We paid lots of money for two appliances, why not utilize both of them?

This is typically what the active-active argument boils down to; if you are paying for two expensive devices, then it makes sense to use both of them to get the maximum return of your investment, right? Well, there's a problem.

The problem

It is harder to ensure redundancy with two active devices. The point of a cluster is to achieve higher uptime than one device can realistically provide. If the active cluster member suddenly fail, the passive member gracefully takes over operation. This failover is often seamless and with virtually no impact on traffic flowing through the cluster. The faulty member can then be replaced and later rejoin the cluster.

This behavior is typically guaranteed in an A-P cluster as the passive member stay passive until a failover is triggered. All traffic that was handled by the active member is now handled by the passive member.

We cannot guarantee this behavior in an A-A cluster as every members is already handling their share of traffic passing through the cluster. This may cause a situation where the combined load on the members in the cluster is higher than one single member may handle. This is visualized in the diagram below:

From the above diagram we can see that the 75% load on the active member of the A-P cluster equals the total load on the cluster. 10% is used by the system to run the OS while the remaining 65% is allocated to user traffic passing through the firewall cluster. The passive member is not processing user traffic, so total load on the cluster is 75%.

In the A-A cluster, the load on each member is 50% of which 40% is for user traffic. In the case of one member failing, the remaining member must now handle two members' worth of traffic plus its own OS, putting it at 90% load. This would trigger what Fortinet calls Conserve Mode, a state that is reached when the load on the firewall exceeds 85%. In this state, the firewall is forced to take immediate action to stop itself from becoming unresponsive. One such action is to stop accepting new sessions from users.

So, by using an A-A cluster we are more likely to negate the original purpose of the cluster: to provide redundancy. As soon as one member fails, the remaining member may become overwhelmed and either also fail, or it stays online but with heavy throttling as a consequence.

One could think that my examples are a bit unfair. The A-P example only has a user traffic load of 65% while the A-A example has a user traffic load of 80%. An A-P cluster would not be handle the 80% user traffic load either, but the network operator should realistically have a higher chance of noticing the increased load over time and act accordingly, be that upgrading the cluster to a larger model or disabling some security inspection features as a temporary measure.

Thank you for reading, I hope it was worth your time.